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Abstract 



In a previous work (Gabarro-Arpa, J. Math. Chem. 42 (2006) 691-706) a procedure was decribed 
for dividing the 3 x A r -dimensional conformational space of a molecular system into a number of 
discrete cells, this partition allowed the building of a combinatorial structure from data sampled in 
molecular dynamics trajectories: the graph of cells or G, encoding the set of cells in conformational 
space that are visited by the system in its thermal wandering. In this work we describe the 
procedures for building from G an hypergraph allowing to enumerate the basic 3D characteristics 
of molecular conformations in the cells. 
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1. Introduction 



The aim of this series of papers [1-6] is to build a set of mathematical tools for studying the 
energy landscape of proteins [7,8,9], and the present paper is a step further towards this goal. 

The energy surface of proteins is the essential tool for understanding the physico-chemistry of 
basic biological processes like catalysis [9]. It is also a complex multidimensional structure that 
can only be built from the knowledge of the complete dynamical history of the molecule, which 
is currently out of reach for conventional molecular-dynamics simulations (thereafter referred as 
MDS) [9]. One reason is that in an MDS trajectory the position of every atom in the molecule 
is calculated with an accuracy of a hundredth of angstrom, which quickly overhelms even the 
most powerful computers. The main tool developped here can be described as a fluctuation 
amplifier: the small movements of a molecular system, which are easily sampled with the current 
simulating tools, are encoded by means of a simple combinatorial structure, from which the set 
of structures corresponding to realizable combinations of these movements can be generated. 

Within this approach, the 3-D-structures of protein molecules are encoded into binary objects 
called dominance partition sequences (DPS) [1-6], these are the generalization of a combina- 
torial structure known as noncrossing partition sequences [10]. They generate a linear partition of 
molecular conformational spacd3 (in what follows abridged to CS) into a set of connected disjoint 
regions called cells, each harboring the set of 3-D-conformations that have the same DPS. 

Partitions are a useful tool for studying multi-dimensional spaces, in our case they systematically 
spann a much wider volume range than the set of points along a random trajectory curve generated 
by a MDS, they have also been used in many other contexts [7,8,11]. 

The aim of the preeceding papers [1-6] was to construct a graph whose nodes are the cells visited 
by the molecular system in its thermal wandering, two important properties of partition sequences 
make this construction possible : 

1. DPSs are hierarchical structures: partition sequences encoding different sets of cells can 
be merged into a new partition sequence encoding the union set, and the process can be 
repeated with the new sets of cells, thus creating a hierarchjd. The importance of this 
property is that climbing the hierarchy ladder the number of cells increases exponentially 
while the sequence length increases only linearly. This compact coding makes possible 
the construction of a graph representing huge regions of CS* whose size does not exceed the 
memory of a workstation computer, while keeping at the same time the essential information 
about the molecular structures. 

2. DPSs are modular structures: partition sequences can be decomposed into subsequences 
that are embedded in different conformational subspaces. This allows to define a composi- 
tion law: if two partition sequences from two different subspaces share the same sequence 
for the intersection subspace, then joining both sequences gives a realizable sequence that 
corresponds to an existing set of cells [4-6]. 

The first property tells us that the graph can be constructed, the second suggests how to build 
it: a molecular structure can be decomposed into sets of four atoms, its smallest 3D components, 

1 For an iV-atom molecule it is a 3 x (N — l)-dimensional space where each point corresponds to a 3D molecular 
conformation. 

2 A structure called partially ordered set (poset). Posets are widely used tools in many theoretical chemistry 
problems [10,12-20]. 



by composing the graphs of these one can build the graph of the molecule. 

Atoms in MDSs are represented as pointlike structures surrounded by a force field [22,23], the 
convex enveloppe of a set of 4 points in 3D-space is an irregular polytope called a 4-simplex or 
simplex^. The conformational space of these sets is relatively small with 13824 cells, of these 
only a fraction is visited by the system. With a CS so small it can be plausibly assumed that 
the accessible cells are all visited during a MDS run. 

The method for building the graph that was proposed in [2] consists in 

1. Establishing a morphological classification of simplexes, where each class is defined by a set 
of geometrical constraints. 

2. The geometrical constraints that define a class allow to calculate the set of accessible cells 
in a simplex CS [4], thus to each class we can associate a graph where the nodes are the 
cells from this set with edges towards adjacent cells. 

3. On the other hand computer simulations of protein dynamics show [2,4] that in a protein 
structure the majority of simplexes evolve within a reduced number of morphologies. For 
each 4-atom set in the molecule the graph of its CS is built by merging the graphs of the 
visited simplex morphologies. 

4. The CS graph of the molecule, that was called the graph of cells or G in [4], can be built 
by composing the CS graphs of the different simplexes. 

The graph of cells allows to enumerate exactly the set of visited cells in conformational space, 
but since the cells are encoded in a compact form unwrapping them completely is probably algo- 
rithmically hopeless. Instead, in [5,6] it was developped a generalization of dominance partition 
sequences (in what follows abridged to GDPS), that could be geometrically interpreted as a 
bouquet of cones in CS. Its importance lies in two facts: 

1. it encloses the region of CS that harbours the dynamical states of the molecular system, 

2. it can be hierarchically factorised, thus the whole region can be decomposed as a product of 
smaller partitions of molecular conformational spaces, greatly relieving the computational 
effort involved in the enumeration of the cells from CS. 

This subject is developped in the next five sections: 

• Section 2 contains a graphical presentation of dominance partition sequences. 

• Section 3 discusses the factorization of the generalized dominance partition sequences. 

• Section 4 describes the procedure for merging the graph of cells and the GDPS and intro- 
duces the concept of the partition sequences graph. 

• Section 5 discusses the construction of the graph, with a detailed description of a set of 
three algorithmic procedures that make the construction possible. 



3 In what follows this denomination will be used to designate ordered sets of 4-atoms/points. 



2. The Dominance Partition Sequences 



Hidden in complex objects like macromolecules there are simple structures that cannot be seen 
because they are buried under great amounts of information. However, these structures can be 
made to emerge when information is selectively eliminated from the objects [11]. 

Here the only information we keep from the 3Z)-structure of macromolecules are the dominance 
partition sequences (DPS) [1-4], there are three such sequences: one for each cartesian co- 
ordinate x, y and z. For an iV-atom molecular system with atoms numbered from 1 to N the 
DPS of a given coordinate c: is the sequence of atom numbers sorted in ascending order of the 
c coordinate of their respective atoms. 

A simple example of DPS can be extracted from Fig. 1, where an a-carbon skeleton 3D- 
conformation from the pancreatic trypsin inhibitor (PTI) [24] is shown 

z z 




Figure 1 

a-carbon skeleton stereoview of the pancreatic trypsin inhibitor [24]. 

As can be easily seen from Fig. 1 the (x, y, z)-dominance partition sequences of the protein 
conformation are 

{{(58)(49)(29)(48)(57)(27)(28)(31)(30)(52) (32)(47)(53)(50)(19)(26)(21)(56)(51)(24) 
(33)(20)(23)(55)(46)(25)(22)(34)(54)(18) (1)(45)(17) (5) (6)(35)(44) (2) (8)(43) 

(16) (9)(11) (7) (3)(10)(36) (4)(37)(42) (15)(12)(41)(40)(14)(38)(13)(39)} x , 

{(15)(16)(17)(14)(18)(37)(36)(13)(19)(38) (34)(35)(12)(11)(39)(20)(33)(46)(10)(40) 

(32) (47) (21) (45) (44) (9) (48) (31) (41) (22) (49) (50) (43) (42) (8) (51) (30) (23) (24) (52) 
(7) (29) (54) (53) (5) (27) (55) (26) (25) (4) (6) (28) (57) (56) (58) (3) (2) (1)}„ 

{(26)(27)(10) (8)(25) (7)(24)(11) (6) (9) (12)(28)(13)(33)(34)(15)(31)(32)(29)(17) 

(14)(23)(36)(41)(35) (3)(40) (5)(22)(16) (30) (4)(39)(43) (1)(18)(21)(19)(42)(20) 

(44)(38) (2)(37)(55)(48)(45)(51)(52)(57) (56)(47)(46)(54)(49)(53)(58)(50)} 2 } (1) 

DPSs like (1) generate an equivalence relation: two 3-D-conformations are equivalent if they 
have the same dominance partition sequence. Further, for a iV-atom molecular system DPSs 
generate a partition of the (3 x N— 3)-dimensional molecular conformational space into cells whose 
points (3-D-conformations) all have the same DPS. This partition is known to combinatorialist 
as a Coxeter reflection hyperplane arrangement and for an iV-atom molecule is designated as 

A N-1 x A N-1 x A N-1 [25;26]> 



A 



For clarity purposes we have only taken into consideration the a-carbon atoms from the protein. 
Notice that this does not matter much, since the procedures used throughout this work are strictly 
modular and the results obtained for parts or components are also valid for the whole molecule. 

As it has extensively discussed in [10,26] these sequences have interesting combinatorial proper- 
ties. Suppose we have two molecular conformations that for some coordinate axis the atoms, say 
3 and 10, get past each other, obviusly these two conformations will have DPSs (encoding (N — l)- 
dimensional cells in CS [4]) that differ in only two positions: {...(3)(10)...} c and {...(10)(3)...} c 
respectively. 

These can be aggregated in a new sequence {...(3 10)... } x representing the permutations of 3 
and 10 and encoding an (N — 2)-dimensional cell in CS [4,6]. More generally a DPS with a 
sequence of n atom numbers enclosed in parenthesis {.-.(ii 12 ■■■ in-i in)---}c represents the set 
of n\ DPSs corresponding to the permutations of the indices i±, ...i n -i, i n and encodes an 
(N — n)-dimensional cell. 

3. Generalized Dominance Partition Sequences 

To analyse molecular simulations with this procedure DPSs codes need to be further generalized 
in order to handle more complex situations. The sequence below is a valid example of the 
generalization we try to achieve 

{(49 48 (29 27 28) 30 31 52)} c (2) 

where (2) encloses {(49 48 29 27 28)(30 31 52)} c and {(49 48)(27 28 30 31 52)} c as subsequences. 
This means, for instance, that (2) encodes a set of cells from CS where the c-coordinates of atom 
pairs 27 and 48, 27 and 31 can be permuted, but not those of the pair 48 and 31. 

It was shown in [6] that the DPSs from the conformations generated in a molecular dynamics 
trajectory of the PTI protein [27] like (1), are all subsequences of the generalized DPS 

1234 5126734856 9 10 11 7 12 89 

{{(49 (48 (29 (27 28 30 (31) 52) 47 (32 (53) 50) (26) 51) 21 23 24 (19 (20 (25 33) 46 55 (54) 22) 

10 11 13 12 14 15 16 13 17 14 18 15 16 19 20 17 21 18 22 19 23 20 

18 34) 45) (17) (5 44 (8 (6) 35 43 (9) 16 (11) 7) 36 (3 4 (10) 42 (37) (15) (12) 

24 21 22 25 23 24 25 

(41) 40) 38 (14) 13) 39)}, , 

1 2 1 3 2 4 3 5 6 4 5 7 8 6 7 9 8 10 9 11 12 10 11 

{(15 16 (17) (14) (18) (36 (13) 37) (19 (34) 12 35 38) (11) 20 33 (39) (46 (10) 32 40 47) 

13 12 14 13 15 16 14 15 17 16 18 17 18 19 20 19 21 22 23 20 

(21) (45) 44 (31 (9 48) 41) (22) (42 49 50) 8 30 43 51) (23 (24) (7 52 (29 (4 53 54) 

24 21 22 23 24 

(26 27) 5) 6 25 28 55) 3)} y , 

1 2 3 4 5 6 7 1 2 8 9 3 4 5 10 11 6 12 13 7 14 8 15 

{(26 (27 (8 10 (7 (25 (11 (13) 9) 6 24 (12 (28) 33) 31) (34 (15) 32 (29 (17) ( 14) 5 23 (4 

16 9 10 11 12 17 13 14 15 16 18 17 19 20 18 19 

22 35 36 40 41 (3) 30) 39) 16 21 43) (38) 18 19) 20) 37 42 44) (55) (45 48 (52) 51) 

21 20 22 21 22 

(46 47) 54 (49 53) 50) } 2 } (13) 

a graphical, more intuitive form of (13) can be seen in Fig. 2. Also (13) can be thought as 
a pattern: all the DPSs from the protein dynamical conformations match (13). Also, a great 
number of DPSs matching (13) do not correspond to any dynamical conformation. 

Thus, the problem addressed in this paper is: what subset of DPSs matching (13) correspond to 
the molecule's dynamical states ? 



1ST - 

HE- 



IST - 
|ZT" 



28 
3 0. 

m 



5J2 
47 



133 

HP 



rs - 

16 , 
EE 



5) 



it 



HP" 



37 



5JL 
~21 



23 
24 
ITT 



loo- 
ps' 



m — 

12 

35 
3B 



10 
11 



_3]3 
~~ 46 

|5lT 



2J2 

18 
3|4 



12 



TE 

20 
33 

m 



10 



13 



1_ 

HP 



|ZB — 
IZT" 



10 
IT 



P3 - 
ITT 



11 
12 



32 

40 u. 

4|7 



44 
18" 



24 
IT? 



13 



HP" 

^3B 



3L 



3X_ 



14 



15 
16 



35 
43 
5 



2"L_ 
5P" 



44 
I3T" 



10 
11 



14 



32 
T3 



12 
13 



T7 



IF" 
5 



15 
16 



1 6. 



_4J3 



17 



2? — 
?T" 



17 



23 



14 



22 
35 
36 
40 
41 

S 



15 



36 
13" 



18 



19 



IP 
42 



49 
5D 



20 



30 
43 ,. 

si 



18 



21 
22 



IF 

4i 



20 



38 — . 
TJT 



24 



IP 
36 



25 



X 



iztr 



19 



IP- 
SE 

16 

21 



43 



16 



3T 



18,. 
2D 



20 
21 



52 
IZT" 



22 



37 
42 
4H 



17 



53 

5i 



23 



2|7 



25 
28^ 

5_B 



24 



Y 



18 
19 



3-5 — 
~3"5" 



48 _ 

15 r 



5 L 
— ?F 



20 



4 7 
54 



21 



53 



5 ) 



22 



Figure 2 

The generalized partition function (13) in graphical form. With the permutation sequences enclosed in squares. 



4. Merging the Graph of Cells with the GDPS 



While (13) contains approximate information about the whole system, the graph of cells G 
contains exact information on the system fragments. Thus, the problem above could be solved by 
merging both. 

To do this we must proceed to decompose (13) into its component DPSs, first by enumerating 
the complete set of non-intersecting sequences of permutation^)- These can be extracted with a 
three-step recursive procedure from the graph in Fig. 3, whose nodes are the sets of permutations 
with forward edges towards the adjacent non-intersecting permutation sets: 

1st by taking the paths that go through non-decreasing node numbers, starting at a node with 
no backward links and ending at one with no forward links, 

2nd the DPSs are formed by discarding in (13) the permutations that are not in the path, 

3rd the intervening sequences inherit the sructure from (13) restricted to them, this again gen- 
erates GDPSs and the procedure has to be applied recursively to each of them. 

Pairs of adjacent permutations, however, are far less complex GDPSs than (13) and decomposing 
them into simple sequences is much less computationally demanding. 

The graph from Fig. 3 suggest how to obtain the DPSs representing dynamical conformations 
from (13) and the graph of cells G 

1. We arbitrarily consider only one the axis from Fig. 3, x for instance. The result would be 
the same with any other axis, this because (13) is used only as a guideline for agregating 
DPSs from the simplexes in G 

2. In the graph of Fig. 3 we take in succesion the linked pairs of permutations sets, say V% 
and Vj, together with the intervening sequence 2jj in between, and form the x-dominance 
partition sequence 

{(ViKlijXPj)}* (14) 

Let Afij = {n\, nz, n^, ...} be the set of atom numbers in (14), for y and z we simply reduce 
(13),, and (13) z by eliminating the indices that are not in the set Mi j, this gives a GDPS 

3. Similarly we eliminate from G the simplexes with vertices numbers that are not in Nij, and 
from the CS of each remaining simplex we eliminate the cells whose DPS does not match 
T^Ni j ■ We call the resulting graph G^ . . 

4. As described in [5] the [N — l)-dimensional cells from G_v 4 . are aggregated into a smaller 
number of lower dimensional cells, giving the final compact form Cj\f t . of the reduced graph 
of cells. 

5. Building the Dominance Partition Sequences Graph 

Constructing every DPS from G is algorithmically hopeless, instead the approach developped 
here is built upon the fact that DP sequences can be factorized on two levels: 



4 So that no permutation set can be added to the sequence without intersecting one of the sets in it. 
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Figure 3 

GDPS skeleton graph: the nodes are he permutation sequences from (13) with links towards adjacent 

non-intersecting sequences. 



first as shown in the previous section GDPS can be factorized into pairs of connected adjacent 
permutation sequences, 

second DPSs can be aggregated so that specific segments inside the sequence form permutation 
sequences (thereafter abridged to PS) encoding great amounts of information [1-4]. More- 
over, a given PS frequently appears in many other aggregated DPSs. 

Thus we can think of PSs as the nodes of a directed graph with incoming arcs from the 
left-DPSs and outgoing arcs towards the rigth-DPSs. For sake of simplicity we call this 
graph D. 

In what follows we proceed with the second factorization level. That is, a procedure is described 
for factoring linked adjacent pairs of PSs from the GDPS (13) into a directed graph of PS patterns. 
Later these graphs can be easily joined together. All the information needed to build D can be 
found in C and it can be done in 4 steps. 

The first step consists in finding the nodes of D, these can be found in the DPS set associated 
with each simplex in C. As discussed in [5], the DPSs for each simplex in C are projections of 
DPSs from the molecule. For instance we have the DPS for the simplex with vertex numbers 
{12,14,40,41}: {{(12 41)(40)(14)} a ,{(14)(12)(4O)(41)} a ,,{(12)(14 40 41)}},, a glance gives the 
potential permutation sequence nodes (12,41) x , (14, 40,41) 2 as well as (14) x , (41) x , (14) y , ... 

In the second step we need to determine for each PS vertex the positions it can have inside a 
DPS. Again this information can be obtained from C, if for coordinate c we have a PS V this 
number is determined with the procedure below. 

Assume that for the coordinate c we have the PS (V x ) c with set of indices Xi let &x a simplex 
with vertices set V = {v±,V2, ^3,^4} such that xC V, and let S x be the adjacent^ set of simplexes 
such that the union of their vertices sets is J\f and S x is minimal. Also, let DPS-p x be the set of 
DPSs from S x that have {V x ) c subsequence. 

Then the positions of (V x ) c in the DPSs can be determined with the following recursive procedure 

Procedure 1. 

1. Let LEFT = {} and POS = {} 

2. for each Q e DPS V% let n = 

3. for each S € S x we select from its DPS set the subset DPSq of sequences compatibl^l 
with Q, 

4. for each X c G DPSq let A be the set of indices to the left of (V x ) c then n = n+ \A\LEFT\ 
and LEFT = LEFT U A 

5. POS = POS U {n} go to step 4 

the algorithm above has been purposely restricted to |x| < 4 only for sake of clarity, it can be 
straightforwardly extended beyond. 

In the third step we seek to determine the connexions between the permutation sequences: two 
5 That share three vertices. 

B Two DPSs from two adjacent simplexes when reduced to their common indices give the same sequence are said 
to be compatible [6]. 



PSs (V\) c and {V 2 ) c with positions pi and p2 such that p2 = pi + |(7 3 i) c | can be adjacent 
subsequences in some DPS. This can checked with the following procedure: 

Procedure 2. 

1. Let xi = {h,i2,-,in} and X2 = {ji,j2,-,jm} be the set of indices for (Vi) c and (V 2 ) c 
respectively, 

2. let us define the sets of PSs X\ and X 2 : 

{(*a^/3)c G -Xi : 1 < a < , a < f3 < n} and 
{{ja,jp)c ^ X 2 :l<a<m, a < (3 < m}, 

3. let us define the set of (g) x (^) simplexes 5^2 such that S € <Si 5 2 has the set of vertices 

v 2 , i»3, 1*4} with two indices from xi and the other two from X2, 

4. for each X% a £ X\ 
for each X 2a £ X 2 

let us select the simplex S G £1,2 such that its set of vertices is Xi a a U X2 a , 

5. let Q be a sequence from the DPS set of S such that p )(X 2a „)} c C Q, if Q does not 
exist, the check is negative and we exit from the procedure, 

6. the check is positive and the procedure is terminated. 

The fourth step consists in transforming D into an hypergrapfQ. The reason for this is that 
DPSs in x, y and z are not independent, in fact there are constraints arising from the 3D structure 
of objects that result in connexions between PSs in one coordinate being associated with specific 
connexions in another coordinate. 

The procedure for determining if two connexions, say {(Vi)(V 2 )} Cl and {(Qi)(Q 2 )} C2 are simul- 
taneously present in any one 3D-DPS can be described as follows: 

Procedure 3. 

1. Let XV1 an d XVi be the set of indices of {Vi) Cl and {V 2 ) Cl respectively, and 
let XQi and xq 2 be the set of indices of (Qi) C2 and (Q 2 ) C2 respectively, 

2. let S-p 12 and Sq x 2 be a set of simplexes defined exactly as in step 3 of procedure 3, 

3. for each S-p € S-p ± 2 let Q-p a DPS defined exactly as in step 5 from procedure 3 for S-p 
respectively. 

4. for each Sp we progressively substitute one by one the indices from the vertices \V\ 2 by 
the indices Xq 1j2 until Sp becomes some Sq, this generates 4 simplexes: S s (l < s < 4), 

5. for each intermediate stage we build the set DPS S (1 < s < 4), starting with 

DPS 1 = {Qp}, each DPS S is the subset of DPSs from 5 s that are compatible with those 
from DPS 3 - 1 . 

6. let DPSq 1 2 S DPSq the subset of sequences from S 4 each being defined as in step 5 of 
procedure 3. If DPS Ql2 n DPS 4 = the check is negative and we exit the procedure 

7. steps 3 to 6 are repeated foreach Sq € 5q x 2 , 

7 In ordinary graphs, an edge in a simple graph is a pair of vertices. In hypergraphs an edge can be an arbitrary 
subset of vertices [28] . 



1 n 



8. the check is positive and the procedure is terminated. 
6. Conclusion 

This paper adresses the algorithmic issues that arise in enumerating the sets of cells in CS 
corresponding to the dynamical states of a molecule. More precisely it introduces an important 
structure: the graph of partition sequences, allowing to enumerate the realizable partition 
sequences and hence the accesible cells in CS. 

This is shown to be possible because the generalized partition sequences make possible a second 
factorization of DPSs, thus generating a reduction in algorithmic complexity making possible the 
construction of D, as shown by the 3 procedures described in this work. 

This is most important, since DPSs are some sort of skeleton of the 3D molecular structure 
and mehods for reconstructing molecular structures from this skeleton are to be developped in 
forthcoming works of this series. 
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